Phylogenetics is the study of the evolutionary history and relationships among individuals or groups of organisms
Phylogenetics goes back to the origin of evolution
> Darwin drew this tree
There are many, many applications of phylogenetics:
Applications
There are a lot of learning materials online. Here’s an intro course from EBI: https://www.ebi.ac.uk/training/online/course/introduction-phylogenetics
A phylogenetic tree is a tree structure (with nodes and edges) in which tips correspond to organisms and internal nodes correspond to inferred common ancestors
Root usually an inferred oldest node, or node separating your study organisms from an “outgroup”. There may not be a root.
load("demotree.Rdata")
library(ape)
plot(tr, edge.width=2, edge.col="grey",tip.col=c("grey","grey","grey","blue","blue","blue",rep("grey", 6)))
If we are studying the blue organisms we might want to root the tree this way (but we don’t have to – it depends on the question):
tr=root(phy=tr, outgroup=c("Organism 4", "Organism 5", "Organism 6"))
plot(tr,edge.width=2, edge.col="grey",tip.col=c("grey","grey","grey","blue","blue","blue",rep("grey", 6)),main="The same tree, with a different root")
The horizontal axis – branch lengths – are in units of genetic distance, typically something like the number of substitions per site in molecular (sequence) data. Sometimes we create a timed phylogenetic tree. Its branch lengths are in unit of time.
The vertical axis – NOTHING!
In the above tree, Organism 11 is as closely related to Organism 8 as Organism 10 is to Organism 9.
Sister groups: two groups descending from the same ancestor
Clade: an ancestor and all its descendants. This is also sometimes called a monophyletic clade.
Common ancestor (of a set of tips): a node that is an ancestor of all tips in the set
Most recent common ancestor (MRCA) of a set of tips: the common ancestor of the set of tips that is farthest from the root
This tree shows groups of animals and also indicates inferred evolutionary events that happened on some of the branches of the tree:
So: is it practical to find the truly best tree among all the possible trees? (A: not really, but it works quite well. There is usually some uncertainty!)
Phylogenetic analysis usually proceeds in stages like the ones in this cartoon.
We’ll start from step 6 in this lecture. Not that the other steps aren’t important, but 1-3 are specific to your research question. 4 and 5 are in other areas of bioinformatics.
These days, now that sequencing technologies are so affordable, molecular (sequence) data are usually the key data for phylogenetic tree reconstruction.
Why use molecular data?
- DNA is inherited material
- We can now easily, quickly, inexpensively and reliably sequence genetic material
- Sequences are highly specific and rich in information
But we could use morphological or phenotype data
In fact we can also test how consistent trees from genetic and phenotype data are with each other
The characters are the entries of the aligned data. In molecular data, each character is a genetic site or locus. For example, suppose the multiple sequence alignment looks like this:
head demo.fas
>A130
atgaaaccct cgcgccctta tctttaccac gcgtactaca actggatttt agttaatgat
>A131
atgaaaccct tgcgccctta tcttttccac gcgtcctaca actggatttt agataatgat
>A132
atgaaaccct cgcgccctta tctttaccac gcgtactaca actcgatttt agataatgat
>A133
atgaaaccct tgcgccctta tctttgccac gcgtactaca actggatttt agataatgat
Here the organism names are “A130”, “A131”, “A132” and “A133”. There are 60 characters, written in 6 groups of 10. All organisms have an ‘a’ as the first character.
- Distance-based methods
Compute distances between pairs of sequences and find a tree (eg using clustering) that best describes these. example: neighbour joining
In fact the distances are still based on characters, but the characters aren’t considered one at a time - they are pooled together to create pairwise distances between all the sequences in the dataset.
- Character-based methods Consider sequences of (DNA, RNA, morphological) characters, one at a time, and seek the tree with an optimal “score”: likelihood or parsimony; or use a Bayesian method to compute a posterior collection of phylogenetic trees.
Can we get the best tree?
Either way, we need to understand how to compare sequences to each other! We need to either compute a distance between all the pairs of sequences, or to figure out a likelihood or score for a tree.
To do these comparisons, we use models that describe how sequences evolve.
How can we compute the distance between two sequences?
A common approach: the number of differences per site. For example, for the two sequences “AACCTATCGG” and “AATCTAACGG” differ in two places, so they would be 2/10 = 0.2 substitutions per site apart.
In other words, if the number of differences is \(m\) and the total number of sites is \(n\) then this basic distance is \(d_s=\frac{m}{n}\).
However: what if there were actually 2 changes at the same site? This simple computation would actually underestimate the true distance.
There are many models for sequence evolution. They define the probability that a character at a site will evolve into a different character at that site over a specified time period. They can take various complexities into account. Models of sequence evolution can be used to define distances between two sequences, and they can be used to simulate evolving sequences.
A, C, T and G: adenine, cytosine, guanine, thymine
A and G: 2-ring structures; purine
C and T: 1-ring structures; pyrimidine
Transition: a purine-purine change, or a pyrimidine-pyrimidine change (C-T or A-G)
Transversion: a pyrimidine-purine change (or vice versa)
The Jukes-Cantor model treats these as having the same probability
But it is reasonable to think that transitions might be more probable than transversions
Also, A, C, T and G might not be all equally frequent.
The Jukes-Cantor model and other models take this into account. In the JC model, any possible change is equally likely. It is the simplest model of sequence evolution.
Conceptually, the link is this: the JC model (and higher-complexity variations) define the probability that one sequence evolves into another. The “distance” is related to the amount of time that this would take in the model.
The Jukes-Cantor model is the simplest, but they are all based on homogeneous continuous-time Markov chains. These are beyond the scope of this course. The wikipedia page on these models is quite good as a quick reference: https://en.wikipedia.org/wiki/Models_of_DNA_evolution. “Markov Processes for Stochastic Modeling” by Masaaki Kijima is a math reference.
If the substitution rate is \(\mu\) substitutions per unit time, then the mean number of substitutions in time \(t\) is \(\mu t\). The CTMC has a rate matrix \(Q\) whose off-diagonal entries are the instantaneous rates of transition \(A -> C\), \(A -> T\), \(C -> G\) and so on, and these are all equal to \(\mu/4\). The transition matrix is \(P(t) = e^{Qt}\).
Consider the probability that a change from state \(i\) to state \(j\) is observed in time \(t\), if the substitution rate is \(\mu\). We have a probability \(e^{-\mu t}\) that no event happens in time \(t\). The rate of leaving each state is \(3\mu/4\) since 3 of the 4 “changes” (A to A, C, T, G) arrive at a different base. So the expected number of changes in time \(t\) is \(v = 3/4 \mu t\). We can write \(\mu t = 4/3 v\).
\[ P(i \rightarrow j) = P( i\rightarrow j | \text{no event}) P(\text{no event}) + P( i\rightarrow j | \text{event}) P(\text{event}) \]
\[ P(i \rightarrow j) = 0 ~ P(\text{no event}) + \tfrac{1}{4} (1-e^{\mu t}) = \tfrac{1}{4} - \tfrac{1}{4} e^{\mu t}\] >The probability of any change at the site is 3 times this, because there are 3 choices of base that are different from the current one:
\[ P(\text{change}) = \tfrac{3}{4} - \tfrac{3}{4} e^{\mu t} = \tfrac{3}{4} - \tfrac{3}{4} e^{4/3 v}\] > People use this relationship to relate a distance between two sequences (\(v := d_s\)) to the empirical fraction of the sequence’s total length where there is a change between the two, P(change), by inverting.
The Jukes-Cantor model’s distance is
\[ d_{JC} = -\frac{3}{4}\ln(1-\frac{4}{3}d_s) \]
where \(d_s\) is the same as above (number of differences / sequence length).
There is a pdf at this link that does a nice job of explaining how the Jukes-Cantor model is related to sequence distances http://www.montefiore.ulg.ac.be//Users/ccolijnkvansteen/GBIO0009-1/ac20132014/T4/jc.pdf.
JC: ACTG all equally frequent and all changes equally likely
K80: (Kimura 1980) ACTG all equally frequent, but transitions are more likely than transversions
F81 (Felsenstein 1981): as with JC but ACTG are not all equally frequent
HKY85: (Hasegawa, Kishino and Yano 1985) Transitions and transversions are not equally likely, and ACTG are not all equally frequent
In practice, these models are all implemented within software packages that we use to build phylogenetic trees.
Distances are defined using sequence evolution models
Idea: Join closest tips, make new node, repeat!
Details: differing clustering algorithms:
Single linkage, complete linkage, UPGMA
Software: bionj, nj in R
reference: Saitou and Nei 1987
Intuition: Start with all the tips separate. Create a new node that joins the two that are closest to each other. Then define the distances between these two tips and the new node. Then define the distances between the new node and all the other tips. Forget about those two tips, but keep the new node. Repeat until everything is joined up.
Suppose there are \(n\) tips and the distance between \(i\) and \(j\) is \(D_{ij}\).
To make our description more precise, we will need to know a couple of key things.(1) How do we choose which pair to join up next? (2) when we make a new node, how far is our new node away from the two tips it is joining? (3) how far away is our new node from all the other tips?
Two definitions before we start
\[Q_{ij} = (n-2) D_{ij} - \sum_{k=1}^n D_{ik} - \sum_{k=1}^n D_{jk}\]
If \(E\) is the new node joining say \(A\) and \(B\), let
\[ D_{EA} = \frac{1}{2}D_{AB} + \frac{1}{2(n-2)}\left(\sum_{k=1}^n D_{Ak} -\sum_{k=1}^n D_{Bk} \right)\]
The NJ algorithm
Begin with a collection of sequences, unlinked. Compute the distance matrix \(D_{i,j}\).
While there remain nodes that are not joined up:
- Calculate \(Q_{ij}\) for all pairs \((i,j)\), using the current distance matrix (starting from the distance matrix on all the tips)
- Find the pair of taxa \((A,B)\) with the lowest \(Q\) value, \(Q_{A,B}\). If there’s more than one such pair, choose one at random. The \(Q\) criterion comes from wanting to choose \(A\) and \(B\) to minimize the total branch length of the tree in the next step. See Saitou and Nei 1987: https://doi.org/10.1093/oxfordjournals.molbev.a040454.
- Create a new node, temporarily called E. This node will join A and B.
- Use the formula for \(D_{EA}\) above to compute the distance between E and A. Do the same for \(D_{EB}\).
- Calculate the distance between E and all the other nodes in the tree (which are not A or B). (This is easy given the distance matrix and the result of step 4).
- Update the distance matrix: remove the columns and rows corresponding to the temporary A and B and add a row and column for node E. This way, the size of the distance matrix reduces by 1.
Repeat 1 - 6. Eventually there are only two nodes left to join.
NJ strengths Can use realistic substitution model; Computationally efficient: no need to compare lots of trees for an optimality condition; Great for large datasets, short distances.
Tree-like distances: If the distances are perfect (and reflect the true tree) the tree is guaranteed to be correct.
Weaknesses: If the distances aren’t perfect then the tree may well be wrong. In some circumstances the branch lengths can be negative! The method is sensitive to gaps in the sequences. The model does not allow for different sites to have different “matches” to the tree - they all just contribute to the distance. Distance measures may be inaccurate for large distances.
Based on character/sequence evolution on tree
Find the tree that minimises the number of changes you need to describe each character
Here, the tree on the right requires 4 changes. The one on the left only needs 2. So the one on the left is more parsimonious.
Strengths: Simple to score and understand; Computationally efficient (because it’s simple)
Weaknesses: Lack of explicit assumptions (eg hypervariable sites? Sites with many changes?); No branch length or timing information; Not statistically consistent Software: PAUP, MEGA, TNT
Likelihood: probability (or likelihood) or data given parameters of an underlying model, \(L(D | M)\)
These are computed using the same continuous time Markov chains described above: Jukes-Cantor and so on.
These CTMCs give the probability that (for example) an A evolved from a C on a branch of length \(t\), \(P_{a,c}(t)\).
For large samples: max likelihood estimates of parameters (MLEs) are unbiased, consistent and efficient
Felsenstein pruning algorithm makes it feasible to sum over the unknown states at the internal nodes
For example, the likelihood on the little demo tree for site 1 is the likelihood that the root (node 1) is what in state \(s_1\), that it evolves from there to state \(c\) over branch length \(v_c\), and s1 evolves to s2 in length \(v_{12}\). Also, node 2 has to evolve from state \(s2\) to state ‘a’ at site 1, on the branch between node 2 and taxon A. And so on.
Just as \(P_{ij}(t)\) was the probability that \(i \rightarrow j\) in time \(t\) in the JC model, here, define \(P_{s1, s2}(v)\) to be the probability that state \(s1\) transitions to state \(s2\) on a branch of length \(v\). Let \(\pi_{s1}\) be the probability of state \(s1\); in the Jukes Cantor model \(\pi_{s1}=1/4\) for all four states.
Since we don’t know the sequences at the internal nodes ‘node 1’ and ‘node 2’, they could be any of the 4 bases.
\[ L(\text{site 1}) = \sum_{s1\in \{a,c,t,g\}}\sum_{s2\in \{a,c,t,g\}} \pi_{s1} P_{s1, a}(v_c) P_{s1, s2}(v_{12}) P_{s2,a}(v_a) P_{s_2,a}(v_b) \]
The key observation in Felsenstein’s pruning algorithm is that we can group these sums together:
\[ L(\text{site 1}) = \sum_{s1\in \{a,c,t,g\}}\pi_{s1} P_{s1, a}(v_c)\sum_{s2\in \{a,c,t,g\}} P_{s1, s2}(v_{12}) P_{s2,a}(v_a) P_{s_2,a}(v_b) \]
We can do this for every single site (in this example, there are 4 sites) to get the tree likelihood:
\[ L(\text{data}|\text{tree}) = \prod_j L(\text{site j}) \]
Statistically, the topology is a model and the rates of substitution and so on are parameters. One of the huge triumphs of maximum likelihood models is the ability to estimate these parameters by computing the above likelihood and maximizing it. This requires sampling over the space of possible trees, which is BIG.
Central assumptions used in defining the likelihood:
- Independent evolution at different sites, and (2) Independent evolution in different branches
\[L = \prod_{i \in sites} L_i(D_i |M)\]
where \(D_i\) is the data for the characters at site \(i\)
Real datasets may break both assumptions! So be careful
See Felsenstein 1981, Felsenstein book: Inferring phylogenies
Software: RAxML, PhyML, R (phangorn), IQtree, FastTree
Strengths: rich repertoire of sophisticated models. Can estimate parameters of underlying models. This means we can estimate the mutation rate, transition/transversion ratio, etc.
Remember in the demo tree about organisms? ML methods allow us to estimate what the sequence (or phenotype, if we use phenotype data) was like at internal points in the tree.
Weaknesses: primarily computational cost.
Bayes’ theorem: \(P(M|D) = \frac{P(D|M) P(M)}{P(D)}\) where M: model, D: data.
\(P(D|M)\) is the same as the likelihood
\(P(M)\) : prior information
Use MCMC to generate a sample from the posterior
Software: Beast, MrBayes
Strengths: Clear re: uncertainty. The posterior collection of trees includes trees supported by the data, with tree topologies more often included if they are better supported.
Note very different philosophy to ML. There are many flexible models of evolution (as in ML) Parameter estimation (as in ML). Bayesian approaches take prior knowledge into account.
Weaknesses
Posterior probabilities appear too high: why? Sensitive to model violations;
Prior knowledge hard to obtain; people use defaults in software but innocent-looking priors can really drive the posterior
Computationally slow, intractable past a few hundred sequences
Three things to consider:
- Consistency: approaches true value if the amount of data grows. Parsimony may not be consistent. MLEs of parameters usually are
- Efficiency: unbiased estimate with lowest variance. Usually people look at the probability of recovering the correct tree as the number of sites increases
- Robustness: correct answer even if assumptions are violated?
You will probably also consider speed! Broadly: ML beats parsimony for molecular data
Resample columns of your data (so some are left out, others repeated)
Repeat your tree reconstruction
Perform this many times
For each edge in your original tree: how many times does its descending clade occur among your bootstrap trees?
Bootstrap numbers don’t have a clear statistical interpretation. But bootstrapping is very important to understand tree uncertainty.
Not so good if sites aren’t independent. Why? –because you get similar data on different bootstraps, more than you would if the sites were independent. So nodes can have “high bootstrap values” (ie a split in a tree occurs often) not because the tree is a particularly great model but because the bootstrap resampling isn’t doing what it would be doing if the independence condition were met.
Yang Z, Rannala B. Molecular phylogenetics: principles and practice. Nat Rev Genet. 2012;13: 303–314. doi:10.1038/nrg3186
Felsenstein book: Inferring Phylogenies https://www.amazon.co.uk/Inferring-Phylogenies-Joseph-Felsenstein/dp/0878931775
EBI course: http://www.ebi.ac.uk/training/online/course/introduction-phylogenetics
Neighbour joining revealed (Gascuel and Steel 2006) https://academic.oup.com/mbe/article/23/11/1997/1322446
Many other online resources are freely available